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Abstract 

We consider a cognitive radio network with multiple primary users (PUs) and one secondary user (SU), where a spectrum 
server is utilized for spectrum sensing and scheduling the SU to transmit over one of the PU channels opportunistically. One 
practical yet challenging scenario is when both the PU occupancy and the channel fading vary over time and exhibit temporal 
correlations. Little work has been done for exploiting such temporal memory in the channel fading and the PU occupancy 
simultaneously for opportunistic spectrum scheduling. A main goal of this work is to understand the intricate tradeoffs resulting 
from the interactions of the two sets of system states - the channel fading and the PU occupancy, by casting the problem as a 
partially observable Markov decision process. We first show that a simple greedy policy is optimal in some special cases. To build 
a clear understanding of the tradeoffs, we then introduce a full-observation genie-aided system, where the spectrum server collects 
■ channel fading states from all PU channels. The genie-aided system is used to decompose the tradeoffs in the original system into 

f> |' multiple tiers, which are examined progressively. Numerical examples indicate that the optimal scheduler in the original system, 

. with observation on the scheduled channel only, achieves a performance very close to the genie-aided system. Further, as expected, 

the optimal policy in the original system significantly outperforms randomized scheduling, pointing to the merit of exploiting the 
£f) 1 temporal correlation structure in both channel fading and PU occupancy. 
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I. Introduction 



Over the past decade, cognitive radio (CR) has been identified as one promising solution to ease the "spectrum scarcity" 
associated with the traditional static spectrum allocation 0Q _ G1- Going beyond the fixed and licensed spectrum allocation, 
\Q ■ a secondary user (SU) can opportunistically access the spectrum owned by the primary users (PUs) in a CR network. This 
f^- ' paradigm shift from static to dynamic spectrum allocation has been shown to bring significant improvement in the spectrum 
t^* . utilization, and hence the system's overall performance. 

A fundamental principle enabling the cognitive capability is built upon the SU's dynamic adaptation of its operation 



. parameters (such as power, frequency, etc.), according to the environmental variations over time. One such variation is the 



channel fading. Often times an i.i.d. flat fading model is used in abstracting fading channels, which fails to capture the temporal 



^Vl * c ~ ' tr, ■ ' — — ^ ~ — — ± , ... 

channel memory observed in realistic scenarios |]5j. An alternative model, namely the Gilbert-Elliot (GE) model J6], has been 
• • ' widely used (see, e.g., |7)-||9]) to capture the temporal correlation in the fading process. Specifically, the GE model uses a 
m ^ . first-order Markov chain with two states: one representing a "good" channel where the user experiences error-free transmissions, 
' and the other representing a "bad" channel with unsuccessful transmissions. 
$_i . Another variation to be considered is the PU's activity on the channels. Note that in a CR network, the SUs have a strictly 
. . .' lower priority in the spectrum usage, and can only access the channels when the PUs are absent H)-||4). This unique spectrum 
usage structure necessitates the inclusion of the channel's PU occupancy state in determining the channel's accessability by 
an SU. 

In many of the existing works (see, e.g., (7), 0, ifTOl — lfT2l only one set of the system states - either the channel fading, 
or the PU occupancy - has been taken into consideration in developing spectrum access strategies by the SU. In this work, 
we take a step forward, and explore the utility of both the states for opportunistic channel access. Specifically, we consider 
a CR network consisting of multiple PU channels and one SU opportunistically accessing one of the PU channels at a time. 
A spectrum server is utilized to periodically schedule the SU to one of the channels for transmission. Worth noting is that 
the usage of the spectrum server is consistent with the recent FCC ruling on the use of a spectrum database in CR network 
operations 1131 . Further, the spectrum server facilitates spectrum management, and enhances the scalability of the network fl4l . 

Dynamic spectrum access in the presence of the temporal variations can be cast as a sequential control problem. We formulate 
this sequential control problem as a partially observable Markov decision process 03). In this context, the spectrum server 
makes scheduling decisions in terms of allocating a PU channel to the SU, based on the channel's PU occupancy state and 
fading state. We model the channel fading by using a two-state first-order Markov chain, i.e., the GE model. On the other 
hand, since the PU activity may possess a long temporal memory (see, e.g., |[T6l . ifTTl ). we develop an "age" model to capture 
the temporal correlation structure of the PU occupancy state. 
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Building on the above model, we examine the intricate tradeoffs resulting from the dynamic interaction of the system states. 
Our main contributions can be summarized as follows: 

> We study opportunistic spectrum scheduling by exploiting the temporal correlation structure in both the channel fading 
and the PU occupancy states. This, to the best of our knowledge, has not been addressed systematically in the literature 
so far. 

• We show that the optimal scheduling involves a multi-tier "exploitation vs. exploration" tradeoff. For certain special cases, 
we establish the optimality of a simple greedy policy, and examine the intricacy of the fundamental tradeoffs. 

• To gain a better understanding of the tradeoffs for the general case, we introduce a full-observation genie-aided system, 
where the spectrum server collects channel fading states from all the PU channels. Using the genie-aided system, we 
decompose the multiple tiers of the tradeoffs, and examine them progressively. 

The rest of the paper is organized as follows. Section |TT] introduces the basic setting and problem formulation in detail. 
In Section [Hi] we identify the fundamental tradeoffs and illustrate them via special cases. Section [IV] further examines the 
tradeoffs by developing a genie-aided system that isolates the impact of channel fading and PU occupancy on the optimal 
reward. In Section [V] numerical results are presented where we evaluate and compare the performance of the optimal policy 
in the original system with baseline cases. We also study the impact of the memory in the channel fading and PU occupancy 
on the relative performances of various baseline cases. This is followed by concluding remarks in Section IVT1 

II. Problem Formulation 

A. Basic Setting 

We consider a CR network with one SU and N PU^J Each PU is licensed to one of N independent channels, henceforth 
identified as PU channels. A PU generates packets according to a stationary process, transmits over its channel if there are 
backlogged packets, and leaves upon the completion of the transmissions. The PU traffic activity is assumed to be identical 
and independent across channels. 

The SU, on the other hand, is backlogged with packets and opportunistically transmits these packets over the PU channels 
with the help of a spectrum server. Time is divided into two timescales: mini-slots and the control slots each constituting K 
mini-slots, as illustrated in Fig. Q] The length of each mini-slot is normalized to fit the transmission of one data packet of the 
PU or the SU. At the beginning of each control slot, the spectrum server schedules the SU to the "best" PU channel that is 
expected to yield the highest average throughput for the SU. The SU then transmits packets in the scheduled channel, until 
it detects the return of a PU@. Upon such an event, the SU suspends transmissions until the beginning of the next control 
slot, when the spectrum server re-schedules the SU to a PU channel based on most recent observations. At the end of each 
mini-slot when the SU transmitted a packet, it sends accurate feedback on the channel fading state (of the PU channel, as seen 
by the SU) corresponding to that mini-slot, to the spectrum server. The spectrum server uses this channel fading feedback, the 
PU traffic observations, along with the memory inherent in these processes to perform informed scheduling decisions at the 
beginning of the next control slot. We discuss the system model and the scheduling problem formulation in more detail in the 
following. 
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Fig. 1. A sketch of the two timescale model. 



B. Problem Formulation 

The opportunistic spectrum access at hand can be viewed as a sequential control problem, which we formulate as a partially 
observable Markov decision process. In the following, we introduce and elaborate the entities involved in the formulation. 

Channel occupancy: The usage pattern on each of the PU channels can be modeled as an ON-OFF process at the mini-slot 
timescale, with ON denoting the busy state where the PU transmits data over the channel, and OFF the idle state where the PU 
is absent. Channel occupancy is the idle or busy state of the PU channels. Let ot,k{n) be a binary random variable, denoting 

'Each user is assumed to be a pair of transmitter and receiver. 

2 This can be accomplished by incorporating collision detection by the SU at the mini-slot timescale. We also assume that PU arrivals coincide with the 
mini-slot boundaries. 
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whether PU channel n, for n G {1, . . . , N}, is idle (ot.k( n ) = 0), or not (ot,k(n) = 1), in the fcth mini-slot of control slot t. 
The corresponding idle probability is denoted by ir° k (n) = Pr(ot,k{n) = 0). 

Idle/Busy age: The PU traffic is temporally correlated, i.e., the current occupancy state on each of the channel depends 
on the history of the channel occupancy. We introduce the notion of "age," defined as follows, to characterize the occupancy 
history: 

The age of a PU channel is the number of consecutive mini-slots immediately preceding the current mini-slot, during which 
the channel is in the same occupancy state as in the current mini-slot. The age is denoted as "idle age " if the channel is in 
idle state in the current mini-slot and "busy age" otherwise. We use Xt(n) to denote the age of channel n at the beginning of 
control slot t. 

As noted earlier, we assume long memory in the PU occupancy state. Specifically, with the definition of age in place, we 
adopt a family of functions monotonically decreasing in age, to denote the conditional probability that a channel will be idle 
(or busy), given that it has been idle (or busy) for x > 1 mini-slots: 

p i( x ) = —FT, 

w x u + d 

Pb{x) = * u=l,2,..., (1) 

X u + C B 

where Cj and Cb are normalizing constants taking positive values. Our occupancy model essentially imposes the following 
realistic correlation structures: 7) the occupancy memory weakens with time, i.e., the impact of past occupancy events on the 
current occupancy state diminishes since the said event happened; 2) the conditional probability that the PU channel is busy 
or idle now, is purely a function of the length of time the channel has been in the most recent state, and is independent of the 
channel occupancy history before the time of the latest transition to the most recent state. In sight of this, the quantities Pi 
and Pb defined in (Q~|i are sufficient for capturing the temporal correlation in the channels' PU occupancy state. 

Channel fading model: At the end of each mini-slot after transmitting a packet, the SU measures the channel fading between 
its transmitter and receiver on the scheduled channel, and feeds back this information to the spectrum server. Inspired by 
recent works Q, JU, |[T0) , we capture the memory in the fading (of the PU channel) between the SU's transmitter and receiver 
using a two-state, first-order Markov chain, with state variations occurring at the mini-slot timescale. The Markov chain model 
is Ltd. across the PU channels. Each state of the Markov chain corresponds to the degree of decodability of the data sent 
through the channel, where state 1 denotes full decodability and state denotes zero decodability. Note that the states can also 
be interpreted as a quantized representation of the underlying channel fading, in the sense that state 1 corresponds to "good" 
channel fading, while state corresponds to "bad" fading. The probability transition matrix of this Markov chain is given as: 



1 — r r 
1-p p 



(2) 



where p is the conditional probability that the channel fading is good, given that it was good in the previous mini-slot; and r 
is the conditional probability that the channel fading is good, given that it was bad in the previous mini-slot. Throughout the 
paper, we will focus on the case when the fading channels are positively correlated, i.e., p > r. 

Belief of channel fading state: Denote by irf k (n) the belief of channel fading state in the fcth mini-slot of control slot t on 
channel n. As is standard Q, lfl5l . the fading state belief is a sufficient statistic that characterizes the current channel fading 
state as perceived by the SU. Further, let ft,k(cit) be a binary random variable denoting the fading state feedback obtained at 
the end of the fcth mini-slot in control slot t on the scheduled channel a t . Also, define T L (-), for L G {0, 1, . . .}, as the Lth 
step belief evolution operator, taking the form: for 7 G (0, 1), 

T i ( 7 ) = T(T i - 1 ( 7 )), (3) 

with T°(7) = 7 and T( 7 ) = jp + (1 — 7 )r. Now, the update of the fading state belief is governed by the underlying Markov 
chain model, and any new information obtained on the channel fading, i.e.: 

!P, at = n, f t ,k( a t) = 1) 

r, a t = n,ft t k(a t ) = 0, (4) 

T« fc (n)), a t ^n. 

Action space: This refers to the set of channels that the scheduling decision is made from. The spectrum server selects 
channels only from those that are currently idled, and the action space At in control slot t can thus be written as: 

A t = {n:o ttl (n) = 0}. (5) 

3 This is a policy level constraint to protect the PU's priority in spectrum access. 
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State: At the beginning of each control slot, the spectrum server makes the scheduling decision based on three factors: 
For each of the PU channels, 1) the idle/busy state at the moment; 2) the length of time the channel has been in the current 
occupancy state (i.e., age); and 3) the fading state belief value. That is, the state of each PU channel n, is represented by a 
three dimensional vector: St(n) = [o t .i(n), Xt(n), 1 (n)]. Accordingly, the state of the system at the beginning of current 
control slot t is described by a iV x 3 matrix 



S t :=[S t (l);...;S t (N)}-- 



M (1) X t (l) 7^(1) 

ot,i(2) x t (2) 7 r t ' 1 (2) 
o tA (N) x t (N) TrfjN) 



(6) 



Horizon: The horizon is the number of consecutive control slots over which scheduling is performed. We index the control 
slots in a decreasing order with control slot 1 being the end of the horizorQ. Throughout the paper, we denote the length of 
the horizon by m, i.e., the scheduling process begins at control slot m. 

Stationary scheduling policy: A stationary scheduling policy V establishes a stationary mapping from the current state St 
to an action a t in each control slot t. 

Expected immediate reward: The expected immediate reward is the reward accrued by the SU within the current control 
slot. Specifically, the SU collects one unit of reward in each mini-slot, if the channel is idle and has good channel fading (i.e., 
conditions that indicate successful transmission by SU). Since the scheduled channel must be idle in the first mini-slot of the 
current control slot, the expected immediate reward can be calculated as: 

K 

Rt(S t ,a t ) =Y,<k(at)irl k {a t ) + ^(a*). (7) 

k=2 

Total discounted reward: Given a scheduling policy V, the total discounted reward, accumulated from the current control 
slot t, until the horizon, can be written as^| 

Vt(St;V)=Rt(St,a t )+PE^_ i E 0t _ 1 Vt-i(St- 1 ;V), (8) 

where j3 € (0, 1) is the discount factor, facilitating relative weighing between the immediate and future rewards, and the 
expectation is taken with respect to fading state belief: 7r|_j = {ir%_i ±(1),..., 1 (iV)}, and PU occupancy: Ot_i = 

{04-1,1(1),..., Ot_i,i(JV)}. 

Objective function: The objective of the scheduling problem is to maximize the SU's throughput, i.e., SU's total discounted 
reward. A scheduling policy V* is optimal if and only if the following optimality equation is satisfied: 

V%;V*)=max J i? t (S t , a*) + pE^ Eo^S^V*) I. (9) 

a t eA t I 

The function V*(St) ■= V^St;V*) is the objective function of the scheduling problem. 

III. Fundamental Tradeoffs 

The decision on opportunistic spectrum scheduling is made based on two sets of system states: the PU occupancy on the 
channel and the channel fading perceived by the SU. On one hand, PUs may return in the middle of a control slot and hinder 
further transmissions of the SU, leading to a decreased reward for the SU. The temporal memory resident in the PU occupancy 
suggests that the past history of channel's occupancy, measured by the age, influences the occupancy state of the channel in 
the future. On the other hand, the PU channels may suffer from "bad" channel fading in the middle of a control slot, even if a 
PU does not return to hinder SU's transmissions.. Similar to the PU occupancy, the historic observation on the fading process 
would help determine the expected channel fading in the future. Note that by way of the channel feedback arrangement, an 
observation of a PU channel fading is made only when that channel is scheduled to the SU. Thus scheduling is inherently tied 
to channel fading learning. Roughly speaking, to maximize the SU's reward, the spectrum server must schedule a channel such 
that the combination of the perceived channel occupancy and channel fading strikes a "perfect" balance between the immediate 
gains and channel learning for future gains. We discuss this intricate tradeoff in the following. 

A. Classic "Exploitation vs. Exploration" Tradeoff 

In the existing literature (e.g., Q, J8], lfT0l - |[T2l ). focus has been cast on considering only one of the factors: either channel 
fading or channel occupancy, along with the associated temporal correlation. The optimal decision is a mapping that best 

4 For the mini-slots, we use the conventional increasing time indexing. 

5 In the subsequent sections, we may drop V and the tiers of expectation to simplify the notation. 
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balances the tradeoff of "exploitation" and "exploration" on the single factor being considered. The exploitation side lets 
the scheduler choose the channel with the best perceived channel fading (or occupancy state) at the moment, corresponding 
to immediate gains; while the exploration side tends to favor the channel with the least learnt information so far, probing 
which can contribute to the overall understanding of the channel fading (or occupancy state) in the network, and thus better 
opportunistic scheduling decisions in the future. 

B. "Exploitation vs. Exploration" Tradeoff in Dynamics of Both Channel Fading and PU Occupancy States 

In contrast to the existing works, we examine the tradeoffs when the temporal correlation in both the channel fading and 
PU occupancy are considered. While the classic tradeoff described above apparently exists, additional tradeoffs arise in our 
context due to the interactions between the two sets of system states. In particular, the long temporal memory in the PU 
occupancy state adds a new layer to the tradeoffs inherent in the problem. For instance, note that the SU can only transmit 
on idle channels. To carry out exploration, the channel being favored in the traditional sense, i.e., the one on which the least 
information is available, may no longer be the preferred choice if this channel is perceived to be unavailable (i.e., busy) for a 
prolonged duration in the near future. In other words, it may not be worth learning the channel as the SU cannot utilize the 
learned knowledge in the near future. 

Fig. |2]is a pictorial illustration of the impact from the occupancy state on the SU's expected reward. The history of occupancy, 
represented by the idle ages xt(l) and X2(t), affects both the immediate and future rewards of the SU. Specifically, as the 
idle age increases, the temporal memory in the PU's occupancy pushes the channel to transit to busy sooner (i.e., time point b 
comes earlier than a in the figure). Therefore, the average availability on the PU channel in the current control slot decreases, 
which leads to a smaller immediate reward for the SU. 

Further, note that the latest mini-slot for which the spectrum server receives channel fading feedback is also the last mini-slot 
before the PU returns, i.e., time points a and b respectively for the two cases in Fig. [2] The duration d\ (likewise, cfe) in 
Fig. H] is an indication of how "fresh" the channel fading information is for the scheduling decision at the beginning of the 
next control slot, i.e., t — 1. With d\ < d%, channel feedback is more fresh in the former case, with a lower idle age Xt(l). 
Thus, age, through its effect on the freshness of feedback, and the availability of the PU channels in the future slots, adds 
another layer to the tradeoffs, thereby influencing the optimal scheduling decision. 



t + l control slot t t-1 




x,(2) > x t (l) b<a 
Fig. 2. An illustration of the impact of age on the SU's reward. 



To better perceive the intricate tradeoffs in the system, we proceed, in what follows, with a number of break-down results 
that aim at illustrating each tier of the tradeoffs progressively. 

C. Tradeoffs Inherent in Immediate Reward 

Consider the PU channel scheduled in the current control slot t. Let ko denote the latest mini-slot before the PU of the 
scheduled channel returns in the current control slot. Clearly, fen is a random variable, taking values in fco € {1, . . . , A"}. Let 
p z = Pr(fcn = z). With x denoting the idle age of the scheduled channel, ko is distributed as follows: For K = 2: 

Pi = l-P z (ar + 1), P2 = Pi(x + l), 

and for K > 3: 

p 1 = l-P I (x+l), 

3 = 1 
K-l 

pk = n p d x +j)- 



-P T (x + z)),z = 2,..., K-l, 



(10) 
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In the following lemma, we establish the structure of the distribution of kg. 

Lemma 1: The distribution of fco is monotonically decreasing in the idle age, x t , for z = 2, . . . , K, and monotonically 
increasing in x t , for z = 1. 

Proof: For K = 2, it is straightforward to establish the conclusion. We next focus on K > 3. First, it is easy to show that 
for z = 1, pi is monotonically increasing in x since Pi{x) is monotonically decreasing. Further, note that if two positively 
valued functions, fi(x) > and f 2 (x) > 0, are both monotonically decreasing in x, i.e., for any positive integer A > 1, 

/i(a;) >/x(x + A), f 2 (x) >f 2 (x + A), 

then the product of the two, fi(x)f2(x), is also a decreasing function since 

fi(x)f 2 (x) > f 1 (x + A)f 2 (x + A). 

Therefore, pk is monotonically decreasing in x. 

Next, we show that p z , z = 2, . , . , K — 1 are monotonically decreasing in x. Based on the above argument, it is sufficient 
to show that for any z = 2, . . . , K — 1, the following function, 

g(x)^P I (x + z-l)(l-P I (x + z)), 

is monotonically decreasing. We have the following simplifications: For A > 1, a positive integer, 

g(x) P^x + z - - P^x + z)) 

g{x + A) P^x + A + z- l)(l-P/(x + A + z)) 

_ (x + A + z - 1)" +Ci {x + A + z) u + d (x + z) u + Cj - 1 
(x + z) u + C7 ' (x + A + z) u + Cj - 1 ' (x + z - \) u + Ci 
= B X B 2 B Z . (11) 

Since x > 0, Ci > 0,u > 1 and A > 1, we immediately have B>\ > 1, B 2 > 1. Further, using the binomial theorem, we 
obtain 

(x + z) u = ( U ^j (x + z-iy >(x + z- l) u + 1, (12) 

and hence B3 > 1. As a result, > 1 and p 2 , . . . ,pk-i are monotonically decreasing in x. This concludes the proof. ■ 

We next present a result that demonstrates the tradeoff inherent in the immediate reward with respect to age. 
Proposition 1: The immediate reward on the scheduled channel is monotonically decreasing in the idle age. 
Proof: As the system model implies, we can rewrite the immediate reward as the following weighted sum: 

K z 

R t (S t ,a t ) = ^2J2 7r tAat)p z , (13) 

z = l k=l 

where p z = Pr(fco = z) is given by (1101 , 

When K = 2, with idle age on a t being xu we have R t (St,at) = 7r| 1 (a t ) + Pi(x t + l)7r| 2 (a t ). Apparently, it increases 
as x t decreases. 

For K > 3, denote by 9 Z = J2k=i n t fc( a *)i z = 1, ■ ■ ■ ,K, It is clear that 

o 1 < e 2 < ■ ■ ■ < e K , (14) 

and {6* z }'s are constants in the idle age xt- To emphasize the role of the argument x t , we rewrite R t (St,at) as R t (xt), and 
p z as p z {x t ). 

Utilizing the result of Lemma [T] and noting that for any positive integer A > 1, \pi(x t ) — pi(x t + A)| = J2z=2(Pz( x t) ~ 
p z (x t + A)), we obtain: 

K K 

Rt(x t ) -Rt{x t + A) =Y J 0ziPz{x t ) -p z (x t + A)) > 9i(pi(x t ) — pi(x t + A)) + 9 2 ^^(p z (x t ) — p z (x t + A)) > 0, 

2 = 1 2 = 2 

i.e., Rt(xt) is monotonically decreasing in the idle age xt, and this establishes the proposition. ■ 
The above result can be readily extended to the following corollary. 

Corollary 1: When all the PU channels have equal fading state beliefs, the immediate reward is maximized by scheduling 
the SU to the channel with the lowest idle age. 

Next, recall that the channel fading is modeled by a positively-correlated Markov chain. Hence, if 7r| i(a t ) > 7rf 1 (at), then 
the inequality ir^ k (a t ) > ^lk{at) holds, for all k = 2, . . . , K. We present the following proposition without further proof. 



6 



Proposition 2: The immediate reward on the scheduled channel is monotonically increasing in its fading state belief at the 
moment. Further, given equal idle ages across all PU channels, the immediate reward is maximized by scheduling the SU to 
the channel with the largest fading state belief value at the moment. 

D. Tradeoffs Inherent in Total Reward 

In this subsection, we illustrate the tradeoffs inherent in the total reward by examining a special case with two channels 
N = 2 and number of mini-slots K = 1. In particular, we show that under these conditions, a simple greedy scheduling policy 
is optimal. The greedy policy is formally defined as follows: In any control slot, the greedy decision maximizes the immediate 
reward, ignoring the future rewards, i.e., 

a t = max {i? t (S t , at)}- (15) 

a t eA t 

We now formally record the result on greedy policy optimality in the following proposition. 

Proposition 3: The greedy policy is optimal when K = 1 and N = 2. 

Proof: To prove the proposition, we begin with the following induction hypothesis: 

Induction Hypothesis: With the length of the horizon denoted by m, m > 2, assume that greedy policy is optimal in all the 
control slots t 6 {to — 1, . . . , 1}. 

The proof proceeds in two steps: First, we fix a sequence of scheduling decisions I := {a m , . . . ,a t +i}, and show that the 
expected immediate reward in control slot t, under the greedy policy, is independent of the scheduling decisions I. Then, we 
provide induction based arguments to validate the induction hypothesis and hence establish that the greedy policy is optimal 
in all the .control slots. 

Let Ut denote the expected immediate reward in slot t 6 {m, — 1, . . . , 1}, given the scheduling decisions I. can be 
calculated as: 

Up= £ C/f(o t (l), 0t (2))Pr( 0t (l), Ot (2)), (16) 

{o t (l),o t (2)} 

where ot(l) (likewise, Oj(2)) is the binary indicator of whether channel 1 (or 2) is idle (oj(l) = 0) or not (ot(l) = 1) in the tth 
control slot, and Pr(o t (l), o t (2)) denotes the joint probability of both channels' availability status (idle or busy). There exist 
four realizations of the vector (ot(l), Ot(2)), namely {(0, 0), (0, 1), (1, 0), (1, 1)}. In what follows, we show that the value of 
Ut(ot(l), Ot(2)) calculated under each of the realizations is independent of the scheduling decisions I. 

Case 1: When (ot(l), 0t(2)) = (0,0). In this case, both channels are idle in control slot t. Let 7if +1 (n) be the fading state 
belief on channel n in control slot t + 1. The expected immediate reward in control slot t, under the greedy policy, can be 
calculated as 

U t {l \0 1 0)^ +1 (a t+1 )p+ (1 -< +1 (a t+1 ))T(^ +1 (5 t+1 )), (17) 

where at+i = {1, 2}\at+i- For notational convenience, we write a = 7rj s +1 (at+i) and a = 7rf +1 (at+i). The reward ujp(0, 0) 
can now be further expressed as 

C/ t W (0,0) = pa + (1 - a)T(a) 

= pa + (1 — a)(ap + (1 — a)r) 

= pP!+rP 2 , (18) 

where 

P 1 =a + a-aa, P 2 = (1 - a)(l - a). (19) 

That is, Pi is the probability that at least one of the channels experiences good channel fading in the previous control slot t + 1, 
while P 2 is the probability that both channels see bad channel fading. It is noted that these probabilities are controlled by the 
underlying Markov dynamics only, and thus Pi and P 2 are independent of the scheduling decisions I. Therefore, /7 t (0, 0) is 
independent of /. 

Case 2: When (oj(l), Ot(2)) = (0,1). In this case, only channel 1 is idle and can be scheduled. The reward /7 t (0, 1) is 
obtained as 

n®fn 11 - J P 7r t+i( 1 ) + r ( 1 - 7r t+i( 1 ))' if "t+i = !> nm 
U t (0,1)- I T« +1 (l)), ifot+i = 2. (2U) 



It follows from that 

i.e., Ut (0, 1) is independent of I. 



u?\o, i)| at+1=1 = uf\o, i)L +1=2 , (2i) 
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Case 3: When (o t (l), o t (2)) = (1,0). Similar to Case 2, only channel 2 can be scheduled in this case, and we have: 

r/< r »fi n^-/ ^i+i^))' ifa t+ i = l, 

U * ^' Uj -\ ^ +1 (2)+r(l-7r t s +1 (2)),ifa t+1 = 2. (ZZ) 

Again, this indicates that 

E^(l,0)| Ot+1= i = C/ t (n (l,0)| at+1=2 , (23) 

i.e., E/j (1,0) is independent of I, 

Case 4: When (ot(l), Ot(2)) = (1, 1). In this case, both channels are busy, and it follows immediately that 

C/ t (f) (l,l) = 0. (24) 

Clearly, Jjf\\, 1) is independent of I as well. 

Next, note that Pr(ot(l), o*(2)) is a function of the ages (xt(l), xt(2)) only, which evolve independently from the scheduling 
decisions I. Thereby, we conclude that Pr(o t (l), Ot(2)) is independent of the scheduling decisions I, and so is the expected 
immediate reward in control slot t, i.e, l]\ = Ut- Now, by extension, we have that the total reward collected from control slot 
t till the horizon is independent of Z, i.e., 

t'=t t'=t 

Thus, the greedy policy is optimal in control slot t + 1 as well. Since t 6 {in — 1, . . .} is arbitrary, the greedy policy is optimal 
in every control slot {m, . . . , 1} under the induction hypothesis. Finally, as the greedy policy is trivially optimal at the horizon, 
i.e., t — 1, using backward induction, we validate the induction hypothesis, and arrive at the conclusion that greedy is optimal 
in all control slots t G {m, . . . , 1}. This establishes the proposition. ■ 
Remarks: Note that the tradeoffs inherent in the special case considered above, i.e., K = 1, N = 2, is more intricate than 
those observed in related recent works (e.g., (0, fl8]), where a control slot coincides with a mini-slot and only one of the 
states: channel fading or PU occupancy, is considered. This is because, despite K = 1, the question of "whether to learn 
a channel that may not be available for scheduling in the near future due to channel occupancy state" still exists. Thus the 
tradeoffs discussed in the preceding subsections are retained in this special case, essentially adding value to our result on greedy 
optimality. In the subsequent section, we proceed to further understand the tradeoffs in the original system by introducing a 
conceptual "genie-aided system." 

IV. Multi-tier Tradeoffs: A Closer Look via A Genie-Aided System 

In the previous section, we partially showed the interaction between various state elements by examining the immediate 
reward and certain special cases. In order to obtain a more complete understanding of the inherent dynamics, we next introduce 
a full-observation genie-aided system that helps decompose and characterize the various tiers of the multi-dimensional tradeoffs. 

A. A Genie-Aided System 

The genie-aided system is a variant of the original system with the following modification: The spectrum server receives 
channel fading feedbacks from all the channels and not only the scheduled channel. These feedbacks are collected at the same 
times as those of the feedback from the scheduled channel. Thus when the PU returns on the scheduled channel, the feedback 
from all the channels stop at once. Note that this is a conceptual system, without practical significance, which as we will see, 
is helpful in better understanding the complicated tradeoffs inherent in the original system. 

B. Tradeoffs Associated with Channel Fading 

Proposition 4: When the idle ages are the same across all PU channels, it is optimal to schedule the SU to the channel with 
the highest fading state belief at the moment, i.e., 

a* t = argmax{7Tj (25) 

n ' 

Proof: First, from Proposition |2] the immediate reward is maximized when scheduling the channel with the highest fading 
state belief at the moment. Now, we focus on showing that the future reward is independent of the action in the current control 
slot and therefore establishing the proposition. Specifically, at the current control slot t,t>2, consider an arbitrary control slot 
in the future, to S {t — 1, . . . , 1}. In the following, we show that the expected immediate reward in this control slot, denoted 
by Ut , is independent of the current action a t , and thus the future reward is independent of a t . 
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Let fcg denote the latest idle mini-slot in control slot to + 1 before the PU returns. The reward Ut is then calculated as: 

K 

Ut, = £ U to (k' )Pi(k' ), (26) 
k' =i 

where Pr(fc ) is the distribution of k' , identical to that of fc as given in ([Tol l, and U to (k ) is the expected reward in control 
slot to for a given k' . 

In the genie-aided system, the spectrum server obtains the feedbacks of the fading states from all N channels simultaneously, 
i.e., at the end of mini-slot k' . Place the channels on which good channel fading is observed at k' Q in the set Ci, and the rest 
in another set Co- The characterization of the reward Ut (k' ) can be further divided into the following cases. 

« Case 1: Co = 0. This corresponds to the case where ft +i,k' (n) = l,Vn = 1, . . ., N. Let wj™ (n) be the expected 
reward in control slot to on channel n in this case. We have 

< asel) (n) = < >)XX<>)p,(s to (n)) ) (27) 

z-l k=l 

where 7r£ 0)1 (n) = T K - k '°(p), Vn G C x . It follows that 7r t s ofe (l) = . . . = tt ( s o k (N),Vk = 1,...,K. Further, since %(1) = 
. . . = Xt(N), we have p z (xt (l)) = . . . = p z (xt (N)), and 7r° o 1 (1) = . . . = tt° q 1 (iV), and therefore, scheduling any of 
the idle channels in Ci achieves the same expected reward in control slot to, i.e., 

C/ to (fc )lca S ei = V^ ( casel) (n),Vn G Ci,o to ,i(n) = 0. (28) 

Now, note that given equal idle ages on all N channels, the distribution of k' Q is identical across channels. Therefore, for 
all 77, = 1, ... , N, the values of 7T* j,(n), ^ = 1> • ■ • j ^> ^to i( n ) an< ^ P«( x *o( n )) a ^ sta y unchanged if a different channel 
at 7^ a' t is scheduled in the current control slot t. This implies that t/t (fc )|casei is unchanged, and is thus independent 
of a t . 

« Case 2: C\ = 0. In this case, ft +i,k' ( n ) — 0j Vn = 1, . . . , N. The expected reward collected in control slot to on channel 
77,, denoted by (n), can be expressed the same as (l27l i. where the channel strength belief 7if 1 (n) = T K ~ k °(r), 

for all n G Co- Then, using the similar reasoning as in Case 7, we obtain that scheduling any of the idle channels in Co 
achieves the same expected reward in control slot to, i.e., 

f/ to (fco)lcase2 = (n) , Vn G C , o t(lA (n) = 0. (29) 

Further, the expected reward achieved in this case, £/t (fc )| ca se2, does not change when the action in current control slot 
t varies, i.e., £/t (fc )| case 2 is independent of a t . 
• Case 3: Co ^ 0, C\ ^ 0. In this case, we first show that the maximum expected reward in control slot to is achieved by 
scheduling any of the idle channels in the set C\, which are perceived to have better channel fading state in the subsequent 
control slot to than the channels in set Co- Specifically, picking any one of the channels from each of the set, n\ G C\ 
and no G Co, we have 

K z 

A (case3) , 



< case3 >i) = EE<o>i)^km), 

2 = 1 fc=l 

K z 

< case3) K) = EE< >o)^ M), (30) 



where 



2 = 1 k=l 



rpK— k Q -\-k— 1 / 



<,*(no) = T A -*o+*- A (r). (31) 

Based on ([3]) and the property of the positively-correlated Markov chain, the following inequality holds: For all k = 
1,...,K, 

with the equality achieved when K — > oo. Further, applying similar argument as before, we obtain 

< CaSe3) (ni)>< ase3) (n ). 

Now, since ni G C\ (likewise, no G Co) is arbitrary, from the conclusion drawn in Case 1, the maximal expected reward 
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in control slot t under this case is achieved by scheduling the SU to any of the idle channels in the set C\, i.e., 

U t0 (K) leases = wg^imWm £ Ci > o t0 , 1 (m) = 0. (32) 

Next, based on the similar reasoning as in the previous cases, we know that f/i (fco)| cases is independent of a t . 
Finally, note that Ut (k' ) can be written as: 

3 

Ut (k' ) = U t0 (k' )\ case JPr(Case i), (33) 

i=l 

where 

Pr(Casel) = JJ < 0+1 , fe ,(n), 

nSCi 

Pr(Case2) = JJ (1 - < 0+1 , fe , (n)), 

Pr(Case3) = JJ < 0+1 , fe ,(n) JJ (1 - < 0+1>Jfc , (n')), 

are quantities dependent on fc only. Based on the fact that the idle age at the moment are identical across the channels, 
the probabilities Pr(Case = 1,2,3, and the distribution Pr(fc ), are the same across the channels as well. Thus Ut is 
independent of a t . 

Since to G {t — 1, . . . , 1} is arbitrary, by extension, we have that the total reward collected from control slot to till the 
horizon, i.e., J2t'=t Uf, which is the future reward of current control slot t, is independent from a t . This concludes the proof 
and establishes the proposition. ■ 

Remarks: Proposition |4] illustrates the effect of fading state belief on the optimal decisions in the genie-aided system. With 
ages equalized across the PU channels and the classic "exploitation vs. exploration" tradeoff neutralized (by definition of the 
genie-aided system), we saw that, higher fading belief favors the immediate reward and that the future reward is, in fact, 
independent of the current decision. 

In the following, we study the effect of PU occupancy and age on the optimal decisions in the genie-aided system and, in 
turn, its impact on the original system. 

C. Tradeoffs Associated with PU Occupancy 

The following proposition identifies the effect of channel occupancy state on the optimal scheduling decisions. 
Proposition 5: When the fading state beliefs are the same across all PU channels, it is optimal to schedule the SU to the 
channel with the lowest idle age at the moment, i.e., 

a£ = argmin{xt(n)}. (34) 

n 

Proof: We prove the proposition by showing that scheduling the channel with the lowest idle age favors: 1) the immediate 
reward Rt (St , fflt); an d 2) the optimal future reward V t *_ t (St- 1 ) . The first part can be readily shown by appealing to Propositi onQ] 
and Corollary Q] To show the second part, we adopt the following induction hypothesis: 
Induction Hypothesis: The optimal future reward is convex in the fading state belief. 

When channel at is scheduled in the current control slot t, the expected future reward can be evaluated as: 

K-i(S t -i)|« t = < M (a t )VU(p) + (i - < i0 K))CiW, 

where V t *_ 1 (p) and V^l 1 (r) represent the future reward calculated when the channel fading state observed in the fcoth mini-slot 
of control slot t is good or bad, respectively. More specifically, 

VU{P) = V t U «_ M (a t ) - T K ~ k °( P )) , 
V t U(r) = VU (n?_ hl (a t ) - T"(r)) . 
Based on ([3]), we have, for 7 £ (0, 1), 

T L ( 7 ) = (p-0 L 7 + ^r ( f ~ T) s ,L = 0,l,..., 
l-(p-r) 

and linii^oo T L (7) = l J r p+r = tt s \ s , where n s \ s denotes the steady-state probability of perceiving good channel fading on 
any of the PU channels. This indicates that, a smaller ko, associated with a higher idle age at the moment (recall discussions 
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in Section BIB, results in a larger K — fco and thus a value of 1 (a t ) closer to 7r s i s , in which case, 

V7-i(St-i)|» t -Hrf, fc0 (o t )V t !. 1 (7r 8 | i >+<l - K k0 M)V;^s\s) = V t U(a u En 0]s ). 

On the contrary, as fco becomes larger because of a lower idle age, the value of deviates further away from 7r s u, 

but is closer to p (or r). Also, 7r| fco ( a t) gets closer to 7T S | S . Therefore, 

7 t *_ 1 (S t _ 1 )| ot ->7r i |,K!L 1 (p>f(l-T.|.)K-i(»-) = E„ sl V;(a t ). 

Now, appealing to the induction hypothesis and the Jensen's inequality 11811 . we have that E 7T ,V*(at) > V t *_ 1 (at, E„. a ), 
and therefore the future reward is maximized by scheduling the channel with the lowest idle age, i.e., 

a* t = min{x t (a t )} s.t. V r 4 l 1 (S t _i)| a « = max{V r 4 *_ 1 (S t _i)| at }. 

a t a t 

Finally, we proceed to verify the induction hypothesis. At t = 2, the optimal future reward equals the optimal immediate 
reward at the horizon t = 1, i.e., 

V 1 *(S 1 )=R 1 (S 1 ,a* 1 ):= max {^(Si, oi)}. (35) 

aiS-4i 

Since J?i(Si, ai) is linear in the strength belief, using the property of convex function [18], we have that V^Si) is convex in 
the strength beliefs, which validates the induction hypothesis. Then, using backward induction, we establish the proposition. 

■ 

Remarks: Proposition [5] explicitly illustrates the effect of the PU occupancy and age on the optimal decisions. With fading 
state beliefs equalized and the classic "exploitation vs. exploration" tradeoff neutralized (by definition of the genie-aided 
system), we saw that: 1) lower idle age on the PU channel favors the immediate reward by allowing more idle time on the 
channel; and 2) lower idle age also favors the future reward by way of better freshness of the channel fading feedback. 

Thus by studying the full-observation genie-aided system, via the results in Propositions |4] and [5] we have decomposed 
the tradeoffs associated with the channel occupancy and the fading state beliefs in the original system. Indeed, the results in 
Propositions @] and |5] rigorously support the understanding we developed earlier in Section [HI] on the tradeoffs in the original 
system. 

V. Numerical Results & Further Discussions 

In this section, we evaluate and compare the optimal rewards of the original system (denoted as V* ri ) and the genie-aided 
system (V* enie ). Also, the optimal policy in the original system is compared to a randomized scheduling policy (with reward 
denoted as V* andom ), where the spectrum server chooses a channel, among all the idle ones, randomly and uniformly, and 
allocates it to the SU for data transmission. The numerical results are collected for a two channel system with K = 2, 
horizon length m = 6, and discounted factor /3 = 0.9. For notational convenience, denote A ga _ or ; = V* ie — V* ri and 

Aori—rnd = V ori — V random . 

Table U records the rewards obtained under various baseline cases, for various values of 6 = p — r, which broadly captures 
the temporal memory in the channel fading. In particular, as 5 decreases, the channel fading memory fades and the difference 
between the baselines, which are primarily differentiated by the degree to which they exploit the memory in the system, tends 
to decrease. This is observed from Table [[] 

TABLE I 

Comparison of rewards when the channel fading memory varies. System parameters used: 
u = l,Cj = l,C B = 2,x t (l) = 10,x t (2) = 5,7^(1) = 0.4,7r t s a (2) = 0.7 





^■ga — ori 




^ori—rnd 




0.8 


0.0077 


0.3% 


0.372 


14.5% 


0.4 


0.0068 


0.29% 


0.2827 


9.42% 


0.2 


0.0023 


9.8 x 10- 4 % 


0.1915 


8.23% 


0.1 


0.0006 


2.4 x 10~ 4 % 


0.1836 


7.93% 



Next, in Table UH we compare the baseline rewards under different values of the power exponent u, used in the definitions 
of Pi and Pb in (U). To build a better understanding of the trend reflected in these numerical results, we consider an arbitrary 
mini-slot, denoted as fc', as the current mini-slot, and the following two exhaustive sets of PU occupancy histories: 1) the set 
of histories h x , x = 1, 2, . . ., corresponds to the case when there are exactly x mini-slots between the current mini-slot fc' and 
the most recent mini-slot (preceding fc') when the channel was in busy state, i.e., the idle age of mini-slot fc' is x; and 2) the 
set of histories h^,x = 1,2,..., corresponds to the case when there are exactly x mini-slots between the current mini-slot 
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Fig. 3. Occupancy histories & the conditional idle probability. 



k' and the most recent mini-slot (preceding k') when the channel was in idle state, i.e., the busy age of mini-slot k' is x. In 
Fig. |3(a)| we plot the two sets of occupancy histories. As has been pointed out in Section ITl-Bl the idle/busy age is a sufficient 
statistic for capturing the memory in the PU channels' occupancy states. Thereby, with the construction of h\, and h^, we can 
examine the effect of the temporal correlation of PU occupancy on the system performance. Specifically, the conditional idle 
probability in mini-slot k', given occupancy history, can be obtained as: 

*°»\hs =1 - Pb ^ = 1 -^Tc^- (36) 

As an example, we plot ir%, \ h i in Fig. |3(b)| It is clear that as u increases, the conditional probability curves become steeper. 
Define the threshold point, xq, such that for all x > xq, the difference in 7r£/|/ji is insignificant (below 10~ 2 ). Now, note from 
the figure that the threshold xq decreases with increasing u, i.e., x < Xq < x . This indicates that the impact of 
different occupancy histories on the current PU occupancy state diminishes with increasing u and thus a decreased memory in 
the PU occupancy. Similar argument holds when considering h^. In short, the exponent u broadly captures the PU occupancy 
memory, and as its value increases, the memory decreases and thus the rewards under various cases, as expected, come closer 
with increasing u. This is illustrated in Table [TT] 

TABLE II 

Comparison of rewards when the occupancy memory varies. System parameters used: 
p = 0.9, r = 0.1, d = 1,C B = 2,x t (l) = 0,x t (2) = 1,7^(1) = 0.4,7^(2) = 0.7 



u 


^ga — ori 




^ori — rnd 




1 


0.0088 


0.35% 


0.3273 


12.66% 


3 


0.006 


0.26% 


0.2201 


9.59% 


5 


0.0043 


0.2% 


0.1839 


8.28% 



Finally, as can be seen from both tables, the original system performs very close to the genie-aided system, while the cost of 
measuring and sending the channel fading feedback is only -h of the latter. Also, the optimal policy significantly outperforms 
the randomized policy, indicating that the temporal correlation structure in the channel fading and PU occupancy can greatly 
benefit the opportunistic spectrum scheduling, when appropriately exploited. 
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VI. Conclusions 



In this work, we studied opportunistic spectrum access for a single SU in a CR network with multiple PU channels. We 
formulated the problem as a partially observable Markov decision process, and examined the intricate tradeoffs in the optimal 
scheduling process, when incorporating the temporal correlation in both the channel fading and PU occupancy states. We 
modeled the channel fading variation with a two-state first-order Markov chain. The temporal correlation of PU traffic was 
modeled using age and a class of monotonically decreasing functions, which can have a long memory. The optimality of the 
simple greedy policy was established under certain conditions. For the general case, we individually studied the tradeoffs in 
the immediate reward, and the total reward. Further, by developing a genie-aided system with full observation of the channel 
fading feedbacks, we decomposed and characterized the multiple tiers of the intricate tradeoffs in the original system. Finally, 
we numerically studied the performance of the two systems and showed that the original system achieved an optimal total 
reward very close (within 1%) to that of the genie-aided system. Further, the optimal policy in the original system significantly 
outperformed randomized scheduling, highlighting the merit of exploiting the temporal correlation in the system states. We 
believe that our formulation and the insights we have obtained open up new horizons in better understanding spectrum allocation 
in cognitive radio networks, with problem settings that go beyond the traditional ones. 
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